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tween the state-of-the-art, a new experimental protocol is presented, with training sets containing 10k, 100k 
and lm images and an evaluation on three test sets, contributed by various research groups. Eleven rep¬ 
resentative works are implemented and evaluated. Putting all this together, the survey aims to provide an 
overview of the past and foster progress for the near future. 
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1. INTRODUCTION 

Images want to be shared. Be it a drawing carved in rock, a painting exposed in a 
museum, or a photo capturing a special moment, it is the sharing that relives the 
experience stored in the image. Nowadays, several technological developments have 
spurred the sharing of images in unprecedented volumes. The first is the ease with 
which images can be captured in a digital format by cameras, cellphones and other 
wearable sensory devices. The second is the Internet that allows transfer of digital im¬ 
age content to anyone, anywhere in the world. Finally, and most recently, the sharing 
of digital imagery has reached new heights by the massive adoption of social network 
platforms. All of a sudden images come with tags. Tagging, commenting, and rating 
of any digital image has become a common habit. As a result, we observe a down¬ 
pour of personally annotated user-generated visual content and associated metadata. 
The problem of image retrieval has been dilated with the problem of searching images 
generated within social platforms and improving social media annotations in order to 
permit effective retrieval. 

Excellent surveys on content-based image retrieval have been published in the past. 
In their seminal work, Smeulders et al. review the early years up to the year 2000 by 
focusing on what can be seen in an image and introducing the main scientific problem 
of the field: the semantic gap as “the lack of coincidence between the information that 
one can extract from the visual data and the interpretation that the same data have 
for a user in a given situation” [Smeulders et al. 20001. Datta et al. continue along this 
line and describe the coming-of-age of the field, highlighting the key theoretical and 
empirical contributions of recent years [Datta et al. 20081. These reviews completely 
ignore social platforms and socially generated images, which is not surprising as the 
phenomenon only became apparent after these reviews were published. 

In this paper, we survey the state-of-the-art of content-based image retrieval in the 
context of social image platforms and tagging, with a comprehensive treatise of the 
closely linked problems of i mage tag assignment, i mag e tag refinement a nd tag-based 
image retrieval. Similar to [Smeulders et al. 200 01 and [Dat ta et al. 2 0081, the focus of 
our survey is on visual information, but we explicitly take into account and quantify 
the value of social tagging. 


1.1. Problems and Tasks 

Social image tags are provided by common users. They often cannot meet high quality 
standards related to content association, in particular for accurately describing objec¬ 
tive aspects of the visual content according to some expert’s opinion | Dodge et al. 2012) . 
Social tags tend to follow context, trends and events in the real world. They are often 
used to describe both the situation and the entity represented in the visual content. In 
such a context there are distinct problems to solve. On the one hand, social tags tend to 
be imprecise, ambiguous and incomplete. On the other hand, they are biased towards 
personal perspectives. So tagging deviatio ns due to spatial and temporal correlation to 
external factors are common phenom ena [Golder and Huberman 2006; Sen et al. 2006 


Sigurbjornsson and Van Zwol 2008; Kennedy et al. 20061. The focus of interests anc 


motivations of an image retriever could be different from those of an image uploader. 

Quite a few researchers have proposed solutions for image annotation and retrieval 
in social frameworks, although the peculiarities of this domain have been only par¬ 
tially addressed. Concerning the role of visual content in social image tagging, several 
studies have shown that people are willing to tag objects and scenes presented in the 


visual content to favor image retrieval for general audie nce [Ames and Naaman 2007 
Sigurbjdrnsso n and Va n Zwol 2008} |Nov a nd Ye 20101. It would be relevant to sur- 
vey why people search images on social media platforms and what query terms they 
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actually use. Although some query log data of generic web image search have been 
made publicly accessible [Hua et al. 2013], its social-media counterpart remains to 
be established. Most of the existing works have rather investigated the technological 
possibilities to automatically assign, refine, and enrich image tags. They mainly con¬ 
centrated on how to expand the set of tags provided by the uploader, by looking at tags 
that others have associated to similar content, so expecting to include tags suited to 
the retriever’s motivations. Consequently, images will become findable and potentially 
appreciated by a wider range of audiences beyond the relatively small social circle of 
the image uploader. We categorize these existing works into three different main tasks 
and structure our survey along these tasks: 

— Tag Assignment. Given an unlabeled image, tag assignment strives to assign a 


et al. 2009; [Verbeek et al. 2010; Tang et al. 20111. 

Tag Refinement. Given an image associated with some initial tags, tag refinement 
aims to remove irrelevant tags from the initial tag list and enrich it with novel, yet 

relevant, tags | 

Liu et al. 2010 

Wu et al. 2013 

Znaidia et al. 2013 

Lin et al. 2013 

(Feng et al. 20 L 

Tag Retrieval 

sibly other tags 
to the tag of in 

w 

Given a tag and a collection of images labeled with the tag (and pos- 
), the goal of tag retrieval is to retrieve images relevant with respect 

terest [Li et al. 2009b; 

Duan et al. 2011; 

Sun et al. 2011 Gao et al. 


Other related tasks such as tag filtering [Zhu et al. 2010 Liu et al. 2011b 

Zhu et al. 

20121 and tag suggestion [ Sigurbjdrnsson and Van Zwol 2008 

Li et al. 2009b 

Wu et al. 


2009] have also been studied. We view them as variants of tag refinement. 


As a common factor in all the works for tag assignment, refinement and retrieval, 
we reckon that the way in which the tag set expansion is performed relies on the key 
functionality of tag relevance, i.e., estimating the relevance of a tag with respect to the 
visual content of a given image and its social context. 


1.2. Scope, Aims, and Organization 

We survey papers that learn tag relevance from images tagged in social contexts. While 
it would have been important to consider t he complementarity of tags, only a few meth¬ 
ods h ave considered multi-tag retrieval ||Li et al. 2012t |Nie et al. 2012 |Borth et al. 


2013]. Hence, we focus on methods that implement the unique-tag relevance model. 
We do not cover traditional image classification that is grounded on carefully labeled 
da ta. For a state-of-the-art overview in that direction , we refer the interested reader 
to lEveringh am et al. 2015[ Ru ssakovsky et al. 2015) . Nonetheless, one may question 
the necessity of using socially tagged examples as training data, given that a number 
of labeled resources are already publicly accessible. An exemplar of such resources is 
ImageNet [Deng et al. 2009], providing crowd-sourced positive examples for over 20k 
classes. Since ImageNet employs several web image search engines to obtain candi¬ 
date images, its positive examples tend to be biased by the search results. As observed 
by [jVreeswijk et al. 2012) , the positive set of vehicles mainly consists of car and buses, 
although vehicles can be tracks, watercraft and aircraft. Moreover, controversial im¬ 
ages are discarded upon vote disagreement during the crowd s ourc ing. All this reduces 
diversity in visual appearance. We empirically show in Section [SA] the advantage of so¬ 
cially tagged examples against ImageNet for tag relevance learning. 

Reviews on social tagging exist. The work by Gupta et al. discusses papers on why 
people tag, what influences the choice of tags, and how to model the tagging process, 
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but its di scussion on conte nt-based image tagging is limited [Gupt a et al. 20 10]. The 
focus of [Jabe en et al. 2016) is on papers about adding semantics to tags by exploiting 
varied knowledge sources such as Wikipedia, DBpedia, and WordNet. Again, it leaves 
the visual information untouched. 


Several reviews that consider socially tagged images have appeared recently. In | Liu 


et al. 2011|, technical achievement s in content-based t ag processin g for social im ages 
are briefly survey ed. Sawant e t al. [Sawan t et al. 2011) , Wang et al. [Wang et al. 20121 
and Mei et al. | |Mei et al. 2014| present extended reviews of particular aspects, i.e., 
collaborative media annotation, assistive tagging, and visual search re-ranking, re¬ 
spectively. In [Sawan t - et al. 2011) , papers that propose collab orative image lab eling 
games and tagging in social media networks are reviewed. In [Wang et al. 2012J the 
authors survey papers where computers assist humans in tagging either by organizing 
data for manual labelling, improving quality of human-provided tags or recommending 


tags f or manual selection, instead of applying purely automatic tagging. In [Mei et al. 
20141 the authors review techniques that aim for improving initial search results, typi- 


cally returned by a text based visual search engine, by visual search re-ranking. These 
reviews offer resumes of the methods and interesting insights on particular aspects of 
the domain, without giving an experimental comparison between the varied methods. 

We notice efforts in empirical evaluations of social me dia annotation and retrieval 
I Sun et al. 2011| Uricchio et al. 2013t|Ballan et al. 2015| . In | Sun et al. 2011| , the au¬ 
thors analyze different dimensions to compute the relevance score between a tagged 
image and a tag. They evaluate varied combinations of these dimensions for tag-based 
image retrieval on NUS-WIDE, a leading benchmark set for social image retrieval 
| Chua et al. 20091. However, their evaluation focuses only on tag-based image ranking 
features, without comparing co ntent-based methods. Moreover, tag ass ignment and 
refinement are not covered. In [Uricc hio et al. 2013||Ballan et al. 2015| , the authors 
compared three algorithms for tag refinement on the NUS-WIDE and MIR Flickr, a 
popular benchmark set for tag assignment and refinement [Huiskes et al. 20101. How¬ 
ever, the two reviews lack a thorough comparison between different methods under the 
umbrella of a common experimental protocol. Moreover, they fail to assess the high- 
level connection between image tag assignment, refinement, and retrieval. 

The aims of this survey are twofold. First, we organize the rich literature in a tax¬ 
onomy to highlight the ingredients of the main works in the literature and recognize 
their advantages and limitations. In particular, we structure our survey along the line 
of understanding how a specific method constructs the underlying tag relevance func¬ 
tion. Witnessing the absence of a thorough empirical comparison in the literature, our 
second goal is to establish a common experimental protocol and successively exert it in 
the evaluation of key methods. Our proposed protocol contains training data of varied 
scales extracted from social frameworks. This permits to evaluate the methods under 
analysis with data that reflect the specificity of the social domain. We have made the 
data and source code public 1 ] so that new proposals for tag assignment, tag refinement, 
and tag retrieval can be evaluated rigorously and easily. Taken together, these efforts 
should provide an overview of the field’s past and foster progress for the near future. 

The rest of the survey is organized as follows. Section [2] introduces a taxonomy to 
structure the literature on tag relevance learning. Section [3] proposes a new experi¬ 
mental protocol for evaluating the three tasks. A selected set of eleven representative 
works, described in Section [4j is compared extensively using this protocol, with results 
and analysis provided in Section|5] We provide concluding remarks and our vision 
about future directions in Section 164 


1 https:// github.com/li- xirong/jingwei 
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2. TAXONOMY AND REVIEW 
2.1. Foundations 

Our key observation is that the essential component, which measures the relevance be¬ 
tween a given image and a specific tag, stands at the heart of the three tasks. In order 
to describe this component in a more formal way, we first introduce some notation. 

We use x, t, and u to represent three basic elements in social images, namely image, 
tag, and user. An image x is shared on social media by its user u. A user u can choose a 
specific tag t to labelBy sharing and tagging images, a set of users U contribute a set 
of n socially tagged images X, wherein X t denotes the set of images tagged with t. Tags 
used to describe the image set form a vocabulary of m tags V. The relationship between 
images and tags can be represented by an image-tag association matrix D e {0, i} nxm ) 
where Dij = 1 means the i-th image is labeled with the y-th tag, and 0 otherwise. 

Given an image and a tag, we introduce a real-valued function that computes the 
relevance between x and t based on the visual content and an optional set of user 
information 0 associated with the image: 


U{x,t; 0) 


We use 0 in a broad sense, making it refer to any type of social context provided 
by or referring to the user like associated tags, where and when the image was taken, 
personal profile, and contacts. The subscript $ specifies how the tag relevance function 
is constructed. 

Having t: 0) defined, we can easily interpret each of the three tasks. Assign¬ 
ment and refinement can be done by sorting V in descending order by /$( x, t: 0), while 
retrieval can be achieved by sorting the labeled image set X t in descending order in 
terms of 0). Note that this formalization does not necessarily imply that the 

same implementation of tag relevance is applied for all the three tasks. For example, 
for retrieval relevance is intended to obtain image ranking [Li 20161 while tag ranking 
for each single image is the goal of assignment [Wu et al. 20091 and refinement [|Qian 
let al. 20141 . 

Fig. [l] presents a unified framework, illustrating the main data flow of varied ap¬ 
proaches to tag relevance learning. Compared to traditional methods that rely on 
expert-labeled examples, a novel characteristic of a social media based method is its 
capability to learn from socially tagged examples with unreliable and personalized an¬ 
notations. Such a training media is marked as S in the framework and includes tags, 
images or user-related information. Optionally, in order to obtain a refined training 
media S, one might consider designing a filter to remove unwanted data. In addition, 
prior information such as tag statistics, tag correlations, and image affinities in the 
training media are independent of a specific image-tag pair. They can be precomputed 
for the sake of efficiency. As the filter and the precomputation appear to be a choice of 
implementation, they are positioned as auxiliary components in Fig. [l] 

A number of implementations of the relevance function have been proposed that 
utilizes different modes to expand the tag set by learning within the social context. 
They may exploit different media, such as tags only, tags and related image content, 
or tags, image content and user-related information. Depending on how /$(x,t;0) is 
composed internally, we propose a taxonomy which organizes existing works along two 
dimensions, namely media and learning. The media dimension characterizes what es¬ 
sential information fi>(x, t; 0) exploits, while the learning dimension depicts how such 
information is exploited. Table [T] presents a list of the most significant contributions 
organized along these two dimensions. For a specific work, while Fig. [l] helps illus- 
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Fig. 1. Unified framework of tag relevance learning for image tag assignment, refinement and 
retrieval. We follow the input data as it flows through the process of learning the tag relevance funciton 
0) to higher-level tasks. Dashed lines indicate optional data flow. The framework jointly classifies 
existing works on Assignment, Refinement and Retrieval while at the same determining their main compo¬ 
nents. 


trate the main data flow of the method, its position in the two-dimensional taxonomy 
is pinpointed via Table [i] We believe such a context provides a good starting point 
for an in-depth unde rstan ding of the work. We explore the taxo nom y along the media 
dimension in Section [272] and the learning dimension in Section [2~3| Auxiliary compo¬ 


nents are addressed in Section 2.4 A comparative evaluation of a few representative 


methods is presented in Section [4] 


2.2. Media for tag relevance 

Different sources of information may play a role in determining the relevance between 
an image and a social tag. For instance, the position of a tag appearing in the tag 
list might reflect a user’s tagging priority to some extent [Sun et al. 2011.1. Knowing 
what other tags are assig ned to the image [Zhu et al. 2012| o r what other users label 
about similar images | Li et al. 2009bl |Kennedy et al. 2009| can also be helpful for 
judging whether the tag under examination is appropriate or not. Depending on what 
modalities in S are utilized, we divide existing works into the following three groups: 1) 
tag based, 2) tag + image based and 3) tag + image + user information based, ordered 
in light of the amount of information they utilize. Table [T] shows this classification for 
several papers that appeared in the literature on the subject. 

2.2.1. Tag based. These methods build f,\,(x, t: 0) purely based on tag information. The 
basic idea is to assign higher relevance scores to tags that are semantically close to the 
majority of the tags associated with the test image. To that end, in [Sigurbjornsson and 
Van Zwol 2008~l |Zhu et al. 2012) relevant tags are suggested based on tag co-occurrence 

(i 


statistics mined from large-scale collections, while topic modeling is employed in [Xu 
et al. 2009|. As the tag based methods presume that the test image has been labelec 


with some initial tags, i.e. the initial tags are taken as the user information 0, they 
are inapplicable for tag assignment. 


2.2.2. Tag + Image based. Works in this group develop f<\,(x. t: 0) on the base of visual 
information and associated tags. The main rationale behind them is visual consistency, 
i.e. visually similar images shall be labeled with similar tags. Implementations of this 
intuition can be grouped in three conducts. One, leverage images visually close to the 


ACM Computing Surveys, Vol. X, No. X, Article X, Publication date: March 2016. 
















































Socializing the Semantic Gap 


X:7 


test image [Li et al. 2009b 

Li et al. 2010; Verbeek et al. 2010; Ma et al. 2010 

Wu et al. 

2011 Feng et al. 2012[. Two, exploit relationships between images labeled 

with the 

same tagJLiu et al. 2009 Richter et al. 2012 

Liu et al. 201 lb; Kuo et al. 2012; Gao 

et al. 2013|. Three, learn visual classifiers from socially tagged examples [Wang et al. 


2009 |Chen et al. 2012} |Li and Snoek 2013 Yang et al. 2014]. By propagating tags 


based on the visual evidence, the above works exploit the image modality and the tag 
modality in a sequential way. By contrast, there are works that concurrently exploit 
the two modalities. This can be approached by generating a common latent space upon 
the image -tag association [Srivast ava and S alakhutdinov 2 014] |Niu et al. 2014| [Duan 
et al. 2014|, s o that a cross media similarity ca n be c omputed be tween imag es and tags 
[Zhuang and Hoi 2011] Qi et al. 2012] Liu et al. 20131. In [Per eira et al. 2014[ , the latent 
space is constructed by Canonical Correlation Analysis, finding two matrices w hich 
separately project feature vectors of image and tag into the same subspace. In |Ma 
et al. 2010 |, a random walk model is used on a unified graph composed from the fusion 
of an image similarity graph w ith an image-tag connection graph. In ||Wu et aL~2013 


Xu et al. 2014 Zhu et al. 2010), predefined image similarity and tag similarity are 


used as two constraint terms to enforce that similarities induced from the recovered 
image-tag association matrix will be consistent with the two predefined similarities. 

Although late fusion has been actively studied for multimedia data analysis |Atrey 
et al. 20101, improving tag relevance estimation by late fusion is not much explored. 
There are some efforts in that di rection, among which interes ting performance has 
been reported in [Qian et al. 20141 and more recently in [Li 20161. 


2.2.3. Tag + Image + User-related information based. In addition to tags and images, this 
group of works exploit user information, motivated from varied perspectives. User in- 
forma tion ranges from the simplest user iden tities | Li et al. 2009b) , tagging prefer¬ 
ences [Sawan t et al. 2010) to user reliability [Gins ca et al. 201 4] and to image group 
memberships [Johnso n et al. 2 0151. With the hypothesis that a specific tag chosen by 
many users to label visually similar images is more likely to be relevant with respect 
to the visual content, | Li et al. 2009b | utilizes user identity to ensure that learning 
examples come from distinct users. A similar idea is reported in [Kennedy et al. 20091, 
finding visually similar image pairs with matching tags from different users. [Gin¬ 
sca et al. 20141 improves image retrieval by favoring images uploaded by users with 
good credibility estimates. The reliability of an image uploader is inferred by counting 
matches bet ween the user-provided tags and ma chine tags predicted by visual concept 
detectors. In [Sawant et al. 2010} |Li et al. 2011b| , personal tagging preference is con¬ 
sidered in the form of tag statistics computed from images a user has uploaded in the 
past. The se past images are used in | Liu et al. 2014) to learn a user-specific embedding 
space. In [Sang et al. 2012a], user affinities, measured in terms of the number of com¬ 
mon groups users are sharing, is considered in a tensor an alysis framework . Similarly, 
tensor based low-rank data reconstruction is employed in [Qian et al. 20151 to discover 
latent associations between users, images, and tag s. Photo timestamps are exploited 
for time-sensitive image retrieval [Kim and Xing 20131, wher e the connection between 
image occurrence and various temporal factors is modeled. In [McParlane et al. 2013b], 
time-constrained tag co-occurrence statistics are considered to refine the output of vi¬ 
sual classifiers for tag assignment. In their follow-up work [McParlane et al. 2013a|, 
location-constrained tag co-occurrence computed from images taken in a specific conti¬ 
nent is further included. User interactions in social networks are exploited in [Sawant 
et al. 2010|, computing local interaction networks from th e comments left by other 
users. In [McAule y and Leskovec 2012[ Johnson et al. 20151, social-network metadata 
such as image groups membership or contacts of users is employed to resolve ambigu¬ 
ity in visual appearance. 
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Comparing the three groups, tag + image appears to be the mainstream, as evi¬ 
denced by the imbalanced distribution in Table [Tj Intuitively, using more media from S 
would typically improve tag relevance estimation. We attribute the imbalance among 
the groups, in particular the relatively few works in the third group, to the following 
two reasons. First, no publicly available dataset with expert annotations was built to 
gather representative and adequate user information, e.g. MIRFlickr has nearly 10k 
users for 25k images, while in NUS-WIDE only 6% of the users have at least 15 im¬ 
ages. As a consequence, current works that leverage user information are forced to 
use a minimal subset to alleviate sample insufficiency [Sang et al. 2012a] Sang et al. 
2012 b) or homemade collections with social tags as ground truth instead of benchmark 
sets I pawant et al. 2010[ Li et al. 2011b| . Second, adding more media often results in 
a substantial increase in terms of bot h computation and memory, e.g. the cubic com¬ 
plexity for tensor factorization in [Sang et al. 2012a|. As a trade-off, one has to use S 
of a much smaller scale. The dilemma is whether one should use large data with less 
media or more media but less data. 

It is worth noting that the above groups are not exclusive. The output of some meth¬ 
ods can be used as a refined input of some other methods. In particular, we observe a 
frequent usage of tag-based methods by others for their computational efficiency. For 


instance, tag relevance measured in terms of tag similarity is used in [Zhuang and 
Hoi 2011; Gao et al. 2013 Li and Snoek 20131 before applying more advanced analy- 


sis, while nearest neighbor tag propagation is a pre-process used in [Zhu et al. 2010 
The number of tags per image is embedded into image retrie val functions in [Liu et a 


2009 Xu et al. 2009; Zhuang and Hoi 2011 Chen et al. 20121. 


Given the varied sources of information one could leverage, the subsequent question 
is how the information is exactly utilized, which will be made clear next. 


2.3. Learning for tag relevance 

This section presents the second dimension of the taxonomy, elaborating on various 
algorithms that implements the computation of tag relevance. Ideally, given the large- 
scale nature of social images, a desirable algorithm shall maintain a good scalability 
as the data grows. The algorithm shall also provide a flexible mechanism to effectively 
integrate various types of information including tags, images, social metadata, etc, 
while at the same time, being robust when not all the information is available. In 
what follows we review existing algorithms on their efforts to meet the requirements. 

Depending on whether the tag relevance learning process is transductive, i.e., pro¬ 
ducing tag relevance scores without distinction as training and testing, we divide exist¬ 
ing works into transduction-based and induction-based. Since the latter produces rules 
or models that are directly applicable to a novel instance flMichalski 1993) , it has a bet¬ 
ter scalability for large-scale data compared to its transductive counterpart. Depend¬ 
ing on whether an explicit model, let it be discriminative or generative, is built, a fur¬ 
ther division for the induction-based methods can be made: instance-based algorithms 
and model-based algorithms. Consequently, we divide existing works into the following 
three exclusive groups: 1) instance-based, 2) model-based, and 3) transduction-based. 


2.3.1. Instance-based. This class of methods does not perform explicit generalization 
but, instead, compares new test images with training instances. It is called instance- 
based because it constructs hypotheses directly from the training instances them¬ 
selves. These methods are nonparametric and the complexity of the learned hypothe- 
ses grows as the amount of traini ng data increases. The neighbor voting algorithm 
I Li et al. 2009b I and its variants [Kenned y et al. 2009} Li et al 2010[ [Truong et aT] 
2012 Lee et al. 2013 Zhu et al. 2014) estimate the relevance of a tag t with respect 


to an image x by counting the occurrence of t in annotations of the visual neighbors 
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Table I. The taxonomy of methods for tag relevance learning, organized along the Media and Learning dimensions of Fig. 
[T| Methods for which this survey provides an experimental evaluation are indicated in bold font. 


Learning 


Media Instance-based 


Model-based 


Transduction-based 


tag 


tag + image 


tag + image + user 


I Sigurbjornsson and Van Zwol 2008] 
Zhu et al. 2012] 


Liu et al. 2009] 
Makadia et al. 2010] 

Tang et al. 2011 

Wu et al. 201 lj 

Yang et al. 2011] 
Truong et al. 2 012] 

Qi et al. 2012] 

Lin et al. 2013 
Lee et al. 2013 
Uricch io et al . 2013] 
Zhu et al. 2014]j 

Ballan et al. 2014] 
Pereira et al. 2014] 


Li et al. 2009b 


Kennedy et al. 

>009] 

Li et al. 2010 

Znaidia et al. 21 

)13] 

Liu et al. 2014 



Xu et al. 2009] 


Wu et al. 2009] 


Guillaumin et 

al. 2009] 

Verbeek et al. 20 

10] 

Liu et al. 2010] 


Ma et al. 2010 

Liu et al. 2011b] 

Duan et al. 2011 


Feng et al. 2012] 

Srivastava and £ 

alakhutdinov 2014] 

Chen et al. 201 

2] 

Lan and Mori 20 

13] 

Li and Snoek 2 

013] 

Li et al. 2013 


Wang et al. 2014 


Niu et al. 2014 


Sawant et al. 20 

10] 

Li et al. 2011b] 


McAuley and Le 

skovec 2012] 

Kim and Xing 2013] 

McParlane et al. 2013a] 

Ginsca et al. 2014] 

Johnson et al. 20 

15] 


Zhu et al. 2010] 


Wang et al. 2010 

Li et al. 2010] 


Zhuang and H 

oi 2011] 

Richter et al. 2012 

Kuo et al. 20 Y* 



Liu et al. 2013 


Gao et al. 2011 


Wu et al. 2013 


Yang et al. 2014] 

Feng et al. 201 

4. 

Xu et al. 2014] 



Sang et al. 2012a 

Sang et al. 2012b 
Qian et al. 2015 


of x. The visual neighborhood is created using features obtained from early-fusion of 
global fe atures | Li et al. 2009b|, distance me tric learning to combine local and global 
features [Verbeek et al. 2010; Wu et al. 2011 1, cross modal lear ning of tags and image 
features [Qi et al. 2012 fBallan et al. 2014[|Pereira et al. 2014) , and fusion of multiple 


single-feat ure learners |E i et al. 2010 Li 2016]. While the standard neighbor voting 
algorithm [Li et al. 2009B) simply let the neighbors vote equally, efforts have been 
ma de to ( heuristically) weight neighbors in terms of their importance. For instance, 
in [Truong et al. 2012 ; |Lee et al. 2013| the vis ual similarity is used as the weights. 
As an alternative to such a heuristic strategy, [Z hu et al. 2014[ models the relation¬ 
ships among the neighbors by constructing a directecTvoting graph, wherein there is a 
directed edge from image Xi to image x :j if x r is in the k nearest neighbors of Xj. Subse¬ 
quently an adaptive random walk is conducted over the voting graph to estimate the 
tag relevance. However, the performance gain obtained by these weighting strategies 
appe ars to be limi ted [Zhu et al. 2014]. The kernel density estimation technique used 
in [Liu et al. 20091 can be viewed as another form of weighted voting, but the votes 
come from images labeled with t instead of the visual neighbors. [Yang et al. 2011) fur¬ 
ther considers the distance of the test image to images not labeled with t. In order to 
eliminate semantically unrelated samples in the neighborhood, sparse recon struction 


from a fc-nearest neighborhood is used in [Tang et al. 2009; Tang et al. 2011 j. In [Lin 
et al. 2013), with intention of recovering missing tags by matrix reconstruction, the 


image and tag modalities are separately exploited in parallel to produce a new candi¬ 
date image-tag association matrix each. Then, the two resultant tag relevance scores 
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are linearly combined to produce the final tag relevance scores. To address the incom¬ 
pleteness of tags associated with the visual neighbors, | Znaidia et al. 2013) proposes 
to enrich these tags by exploiting tag co-occurrence in advance to neighbor voting. 

2.3.2. Model-based. This class of tag relevance learning algorithms puts their foun¬ 
dations on parameterized models learned from the training media. Notice that the 
models can be tag-specific or holistic for all tags. As an example of holistic modeling, 
a topic model approach is presented in [Wang et al. 20141 for tag refinement, where a 
hidden topic layer is introduced between images and tags. Consequently, the tag rele¬ 
vance function is implemented as the dot product between the topic vector of the image 
and the topic vector of the tag. In particular, the authors extend the Latent Dirichlet 
Allocation model [Blei et al. 20031 to force images with si milar visual conte nt to have 
similar topic distribution. According to their experiments [Wang et al. 2014], however, 
the gain of such a regularization appears to be marginal compared to the standard La¬ 
tent Dirichlet Allocation model. [Li et al. 20131 first finds embedding vectors of training 
images and tags using the image-tag association matrix of S. The embedding vector 
of a test image is obtained by a convex combination of the embedding vectors of its 
neighbors retrieved in the original visual feature space. Consequently, the relevance 
score is computed in terms of the Euclidean distance between the embedding vectors 
of the test image and the tag. 

For tag-specific modeling, linear SVM classifiers trained on features augmented by 
pre-trained classifiers of popular tags are used in [Chen et al. 2012] for tag retrieval. 
Fast intersection ker nel SVMs trained o n selected relevant positive and negative ex¬ 
amples are u sed in [Li and Snoek 2013| . A bag-based image reranking framework is 
introduced in [Duan et al. 2011], where pseudo relevant images retrieved by tag match¬ 
ing are partitioned into clusters using visual and textual features. Then, by treating 
each cluster as a bag and images within the cluster as its instances, multiple instance 
learning [Andrews et al. 20031 is employed to learn multiple-instance SVMs per tag. 
Viewing the social tags of a test image as ground truth, a multi-mod al tag suggestion 
method based on both tags and visual correlation is introduced in [Wu et al. 20091. 
Each modality is used to generate a ranking feature, and the tag relevance function is 
a combination of these rankin g features, with t he comb ination weights learned online 
by the Ra nkBoost algorithm [ [Freund et al. 2003) . In [Guillaumin et al. 2009; Ve rbeek| 
et al. 2010) , l ogistic regression models are built per ta g to promote rare tags. In a sim¬ 
ilar spirit to [Li and Snoek 20131, [Zhou et al. 2015] learns an ensemble of SVMs by 
treating tagged images as positive training examples and untagged images as candi¬ 
date negative training examples. Using the ensemble to classify image regions gener¬ 
ated by automated image segmentation, the authors assign tags at the image level and 
the region level simultaneously. 

2.3.3. Transduction-based. This class of methods consists in procedures that evaluate 
tag relevance for all image-tag pairs by minimizing a specific cost function. Given the 
initial image-tag association matrix D, the output of the procedures is a new matrix 
D the elements of which are taken as tag relevance scores. Due to this formulation, no 
explicit form of the tag relevance function exists nor any distinction between training 
and test sets [Joachims 1999]. If novel images are added to the initial set, minimization 
of the cost function needs to be re-computed. 

The majority of transduction-based approaches are founded on matrix factorization 


I Zhu et al. 2010[ Sang et al. 2012at|Liu et al. 2013| Wu et al. 2013[[Ka layeh et al. 2014 


Feng et al. 2014 Xu et al. 2014| |. In |Zhuan g and Hoi 2011| the objective function is a 

linear combination of the difference between D and the matrix of image similarity, the 
distortion between D and the matrix of tag similarity, and the difference between D 
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and D. A stochastic coordinate descent optimization is applied to a randomly chosen 
row of D per iteration. In [Zhu et al. 2010], considering the fact that D is corrupted 
with noise derived by missing or over-personalized tags, robust principal component 
analysis with laplacian regularization is applied to recover D as a low-rank matrix. 
In [Wu et al. 2013], D is regularized such that the image similarity induced from D 
is consistent with the image similarity computed in terms of low-level visual features, 
and the tag similarity induced from D is consistent with the tag correlation score com¬ 
puted in terms of tag co-occurrence. [Xu et al. 20141 proposes to re-weight the penalty 
term of each image-tag pair by their relevance score, which is estimated by a linear fu¬ 
sion of tag-based and content-based relevance scores. To incorporate the user element, 
[Sang et al. 2012aI extends D to a three-way tensor with tag, image, and user as each 
of the ways. A core tensor and three m atrices representing the three media, obtained 
by Tucker decomposition [Tucker 1966], are multiplied to construct D. 

As an alternative approach, in | |Eeng et al. 2014| it is assumed that the tags of an 
image are drawn independently from a fixed but unknown multinomial distribution. 
Estimation of this distribution is implemented by maximum likelihood with low-rank 
matrix recovery and laplacian regularization like [Zhu et al. 2010]. 

Graph-based labe l propagation is another type of tr ansduction-based methods. In 
I Richter et al. 2012) Wang et al. 20101 Kuo et al. 2012| , the image-tag pairs are rep¬ 
resented as a graph in which each node corresponds to a specific image and the edges 
are weighted according to a multi-modal similarity measure. Viewing the top ranked 
examples in the initial search results as positive instances, tag refinement is imple¬ 
mented as a semi-supervised labeling process by propagating labels from the positive 
instances to the remaining examples using random walk. While the edge weights are 
fixed in the above works, [Gao et al. 20131 argues that fixing the weights could be 
problematic, because tags found to be discriminative in the learning process should 
adaptively contribute more to the edge weights. In that regard, the hypergraph learn¬ 
ing algorithm [Zhou et al. 20061 is exploited and weights are optimized by minimizing 
a joint loss function which considers both the grap h structure and the divergence be¬ 
tween the initial labels and the learned labels. In [Liu et al. 2011a], the hypergraph is 
embedded into a lower-dimension space by hypergraph Laplacian. 

Comparing the three groups of methods for learning tag relevance, an advantage 
of instance-based methods against the other two groups is their flexibility to adapt to 
previously unseen images and tags. They may simply add new training images into S 
or remove outdated ones. The advantage however comes with a price that S has to be 
maintained, a non-trivial task given the increasing amount of training data available. 
Also, the computational complexity and memory footprint grow linearly with respect 
to the size of S. In contrast, model-based methods could be more swift, especially when 
linear classifiers are used, as the training data is compactly represented by a fixed 
number of models. As the imagery of a given tag may evolve, re-training is required to 
keep the models up-to-date. 

Different from instance-based and model-based learning where individual tags are 
considered independently, transduction-based learning methods via matrix factoriza¬ 
tion can favorably exploit inter-tag and inter-image relationships. However, their abil¬ 
ity to deal with the extremely large number of social images is a concern. For instance, 
the use of Laplacian graphs results in a memory complexity of 0(|<S| 2 ). The acceler¬ 
ated proximal gradient algorithm used in [Zhu et al. 20101 requires Singular Value 
Decompositio n, which is k nown to be an expensive operation. The Tucker decomposi¬ 
tion used in [ [Sang et al. 2012a| has a cubic computational complexity with respect to 
the number of training samples. We notice that some engineering tricks have been con¬ 


sidered in these works, which alleviate the scalability issue to some extent. In [Zhuang 
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and Hoi 20111, for instance, clustering is conducted in advance to divide S into much 
smaller subsets, and the algorithm is applied to these subsets, separately. By making 
the Laplacian mor e sparse by retaining only the k nearest neighbors [Zhu et al. 2010; 
Sang et al. 2012aJ, the memory footprint can be reduced to 0(k ■ |S|), with the cost of 
performance degeneration. Perhaps due to the scalability concern, works resorting to 
matrix factorization tend to experiment with a dataset of relatively small scale. 

In summary, instance-based learning, in particular neighbor voting, is the first 
choice to try for its simplicity and decent performance. When the test tags are well de¬ 
fined (in the sense of relevant learning examples that can be collected automatically), 
model-based learning is more attractive. When the test images share similar social 
context, e.g., images shared by a group of specific interest, they tend to be on similar 
topics. In such a scenario, transduction-based learning that exploits the inter-image 
relationship is more suited. 


2.4. Auxiliary components 

The Filter and the Precompute component are auxiliary components that may sustain 
and improve tag relevance learning. 

Filter. As social tags are known to be subjective and overly personalized, removing 
personalized tags appears to be a natural and simple way to improve the tagging qual¬ 
ity. This is usually the first step performed in the framework for tag relevance learning. 
Although there is a lack of golden criteria to determine which tags are personalized, 
a popular strategy is to exclude tags which cannot be found in t he WordNet ontology 
I Zhu et a l. 2010| Li et al. 2011b| Chen et al. 2012| Zhu et al. 2012| or a Wikipedia the- 
saurus | |Liu et al. 2009|. T ags with r are occurrence, s ay appearing less than 50 times, 
are discarded in | |Verbeek et al. 2010[ Zhu et al. 2010|. For methods that directly work 
on the image-ta g association matrix |Zhu et al. 2010; Sang et al. 2012a} Wu et al. 2013 
Lin et al. 2013), reducing the size of the vocabulary in terms of tag occurrence is an im¬ 


portant prerequisite to keep the matrix in a manageable scale. Observing that images 
tagged in a batch manner are often nearly duplicate and of low tagging quality, batch- 
tagged images are excluded in [Li et al. 20121. Since relevant tags may be missing 
from user annotations, the negative tags that are semantically similar or co-occurring 
with positive ones are discarded in [ Sang et al. 2012a) . As the above strategies do not 
take the visual content into account, they cannot handle situations w here an image i s 
incorrectly labeled with a valid and frequently used tag, say ‘dog’. In [Li et al. 2009aI, 
tag releva nce scores are a ssigned to each image in S by running the neighbor voting 
algorithm [Li et al. 2009b|, while in [Li and Snoek 20131, the semantic field algorithm 
[Zhu et al. 2012| is further added to select relevant training examples. In [Qian et al.| 
20151, the annotation of the training media is enriched by a random walk. 


Precompute. The precompute component is responsible for the generation of the prior 
information that is jointly used with the refined training media S in learning. For in¬ 
stance, global statistics and external resources can be used to synthesize new prior 
knowledge useful in learning. The prior information commonly used is tag statistics in 


S, including tag occurrence and tag co-occurrence. Tag occurrence is used in | Li et al. 
2009b| as a penalty to suppress overly frequent tags. Measuring the semantic similar¬ 
ity between two tags is important for tag relevance learning algorithms that exploit tag 
correlations. While linguistic metr ics as those derived from WordNet were used before 
the proliferation of social media | |Jin et al. 2005| [Wang et al. 20061, they do not di¬ 
rectly reflect how people tag images. For instance, tag ‘sunset’ and tag ‘sea’ are weakly 
related according to the WordNet ontology, but they often appear together in social 
tagging as many of the sunset photos are shot around seasides. Therefore, similarity 
measures that are based on tag statistics computed from many socially tagged im- 
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ages are in dominant use. Sigurbjornsson and van Zwol utilized the Jaccard coefficient 
and a conditio nal tag probability in their tag suggestion system [Sigurbjornsson and 
Van Zwol 20081, while Liu et al. used normalized tag co-occurrence | |Liu et al. 2013) . To 
better capture the visual relationship between two tags, Wu et al. proposed the Flickr 
distance [Wu et al. 20081. The authors represent each tag by a visual language model, 
trained on bag of visual words features of images labeled with this tag. The Flickr 
distance between two tags is computed as the Jensen-Shannon divergence between 
the corresponding models. Later, Jiang et al. introduced the Flickr context similarity, 
which also captures the visual relationship between two tags, but without the need 
of the expensive visual modeling [Jiang et al. 20091. The trick is to compute the Nor¬ 
malized Google Distance [ICilibrasi and Vitanyi 2007) between two tags, but with tag 
statistics acquired from Flickr image collections instead of Google indexed web pages. 


similarity in the literature | 

Liu et al. 20091 

Zhu et al. 2010 

Wang et al. 2010 ; Zhuang 

and Hoi 2011; Zhu et al. 20: 

L2 

Gao et al. 2013 

Li and Snoe 

k 2013; Qian et al. 20141. 


3. A NEW EXPERIMENTAL PROTOCOL 

In spite of the expanding literature, there is a lack of consensus on the performance of 
the individual metho ds. This is largely due to the fact that existing works either use 
homemade data, see |Liu et al. 2009 Wang et al. 2010 Chen et ah 2012[ |Gao et ah 
2013) , wh ich are not publicly accessible, or use selected subs ets of benchmark data, 
e.g. as in | Zhu et al. 2010) Sang et al. 2012a] Feng et al. 2014) . As a consequence, the 
performance scores reported in the literature are not comparable across the papers. 

Benchmark data with manually verified labels is crucial for an objective evaluation. 
As Flickr has been well recognized as a profound manifestation of social image tagging, 
Flickr images act as a main source for benchmark construction. MIRFlickr from the 
Leiden University [Huiskes et al. 20101 and NUS-WIDE from the National University 
of Singapore [ jChua et al. 2009) are the two most popular Flickr-based benchmark sets 
for social image tagging and retrieval, as demonstrated by the number of citations. On 
the use of the benchmarks, one typically follows a single-set protocol, that is, learning 
the underlying tag relevance function from the training part of a chosen benchmark 
set, and evaluating it on the test part. Such a protocol is inadequate given the dynamic 
nature of social media, which could easily make an existing benchmark set outdated. 
For any method targeting at social images, a cross-set evaluation is necessary to test 
its generalization ability, which is however overlooked in the literature. 

Another desirable property is the capability to learn from the increasing amounts of 
socially tagged images. Since existing works mostly use training data of a fixed scale, 
this property has not been well evaluated. 

Following these considerations, we present a new experimental protocol, wherein 
training and test data from distinct research groups are chosen for evaluating a num¬ 
ber of representative works in the cross-set scenario. Training sets with their size 
ranging from 10k to one million images are constructed to evaluate methods of varied 
complexity. To the best of our knowledge, such a comparison between many methods 
on varied scale datasets with a common experimental setup has not been conducted 
before. For the sake of experimental reproducibility, all data and code are available 
onlineQ 

3.1. Datasets 

We describe the training media S and the test media A as follows, with basic data 
characteristics and their usage summarized in Table |TT| 

Training media S. We use a set of 1.2 million Flickr images collected by the Univer¬ 
sity of Amsterdam [Li et al. 20121, by using over 25,000 nouns in WordNet as queries 
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Table II. Our proposed experimental protocol instantiates the Media and Tasks dimensions of Fig. |T| with three 
training sets and three test sets for tag assignment, refinement and retrieval. Note that the training sets are 
socially tagged, they have no ground truth available for any tag. 


Media characteristics Tasks 


Media 


# images 

# tags 

# users 

# test tags 

assignment 

refinement 

retrieval 

Training media S: 

TrainlOk 


10,000 

41,253 

9,249 


/ 

y 

y 

TrainlOOk 


100,000 

214,666 

68,215 

- 

/ 

y 

y 

Trainlm Li et al. 2012 


1,198,818 

1,127,139 

347,369 

- 

/ 

y 

y 

Test media X: 

MIRFlickr [ Huiskes et al. 2010] 

25,000 

67,389 

9,862 

14 

/ 

y 


Flickr51 [IjWang et al. 2010 

81,541 

66,900 

20,886 

51 

- 

- 

y 

NUS-WIDE [Chua et al. 2009; 

259,233 

355,913 

51,645 

81 


y 

y 


to uniformly sample images uploaded between 2006 and 2010. Based on our observa¬ 
tion that batch-tagged images, namely those labeled with the same tags by the same 
user, tend to be near duplicate, we have excluded these images beforehand. Other than 
this, we do not perform near-duplicate image removal. To meet with methods that can¬ 
not handle large data, we created two random subsets from the entire training sets, 
resulting in three training sets of varied sizes, termed as TrainlOk, TrainlOOk, and 
Trainlm, respectively. 

Test media X. We use MIRFlickr [Huiskes et al. 2010) and NUS-WIDE UChua et al.| 


2009] for tag assignm ent and refinement, as in [Verbeek et al. 2010HZhu et al. 2010 
Uricchio et al. 20 13) and [Tang et al. 2011 McAuley and Leskovec 2012 Zhu et al! 


2010 


as in 


Uricchio et al. 20131 resp ectively . We use NUS-WIDE for evaluating tag retrieval 


Sun et al. 2011; Li et al. 2011a|. In addition, for retrieva l we collected another 


test s et namely Flickr51 contributed by Microsoft Research Asia [Wang et al. 2010 


et al. 20131. The MIRFlickr set contains 25,000 images with ground truth availab 


Gao 
for 

14 tags. The NUS-WIDE set contains 259,233 images, with ground truth available for 
81 tags. The Flickr51 set consists of 81,541 Flickr images with partial ground truth 
provided for 55 test tags. Among the 55 tags, there are 4 tags which either have zero 
occurrence in our training data or have no correspondence in WordNet, so we ignore 
them. Differently from the binary judgments in NUS-WIDE, Flickr51 provides graded 
relevance, with 0, 1, and 2 to indicate irrelevant, relevant, and very relevant, respec¬ 
tively. Moreover, the set contains several ambiguous tags such as ‘apple’ and ‘jaguar’, 
where relevant instances could exhibit completely different imagery, e.g., Apple com¬ 
puters versus fruit apples. Following the original intention of the datasets, we use 
MIRFlickr and NUS-WIDE for evaluating tag assignment and tag refinement, and 
Flickr51 and NUS-WIDE for tag retrieval. For all the three test sets, we use the full 
dataset for testing. 

Although the training and test media are all from Flickr, they were collected inde¬ 
pendently, and consequently they have a relatively small amount of images overlapped 
with each other, as shown in Table |TIT| 


3.2. Implementation 

This section describes common implementations applicable to all the three tasks, in¬ 
cluding the choice of visual features and tag preprocessing. Implementations that are 
applied uniquely to single tasks will be described in the coming sections. 

Visual features. Two types of features are extracted to provide insights of the per¬ 
formance improvement achievable by appropriate feature selection: the classical bag 
of visual words (BoVW) and the current state of the art deep learning based features 
extracted from Convolutional Neural Networks (CNN). The BoVW feature is extracted 
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Table III. Data overlap between TrainIM and the three test sets, measured in terms of the number of shared 
images, tags, and users, respectively. Tag overlap is counted on the top 1,000 most frequent tags. As the 
original photo ids of MIRFlickr have been anonymized, we cannot check image overlap between this dataset 
and TrainIM. 


Overlap with TrainIM 


Test media 

# images 

# tags 

# users 

MIRFlickr 

- 

693 

6,515 

Flickr51 

730 

538 

14,211 

NUS-WIDE 

7,975 

718 

38,481 


by the color descriptor software [Van De Sande et al. 2010J. SIFT descriptors are com¬ 
puted at dense sampled points, at every 6 pixels for two scales. A codebook of size 1,024 
is created by K-means clustering. The SIFTs are quantized by the codebook using hard 
assignment, and aggre gated by sum pooling. In addition, we extract a compact 64-d 
global feature [Li 2007], combining a 44-d color correlogram, a 14-d texture moment, 
and a 6-d RGB color moment, to compensate the BoVW feature. The CNN feature is 
extracted by the pre-trained VGGNet [Simonyan and Zisserman 20151. In particular, 
we adopt the 16-layer VGGNet, and take as feature vectors the last fully connected 
layer of ReLU activation, resulting in a feature vector of 4,096 dimensions per image. 
The BoVW feature is used with the L distance and the CNN feature is used with the 
cosine distance for their good performance. 

Vocabulary V. As what tags a person may use is meant to be open, the need of spec¬ 
ifying a tag vocabulary is merely an engineering convenience. For a tag to be mean¬ 
ingfully modeled, there has to be a reasonable amount of training images with respect 
to that tag. For methods where tags are processed independently from the others, the 
size of the vocabulary has no impact on the performance. In the other cases, in par¬ 
ticular for transductive methods that rely on the image-tag association matrix, the 
tag dimension has to be constrained to make the methods runnable. In our case, for 
these methods a three-step automatic cleaning procedure is performed on the training 
datasets. First, all the tags are lemmatized to their base forms by the NLTK software 
I Bird et al. 20091. Second, tags not defined in WordNet are removed. Finally, in order to 
avoid insufficient sampling, we remove tags that cannot meet a threshold on tag occur¬ 
rence. The thresholds are empirically set as 50, 250, and 750 for TrainlOk, TrainlOOk, 
and Trainlm, respectively, in order to have a linear increase in vocabulary size versus 
a logarithmic increase in the number of labeled images. This results in a final vocab¬ 
ulary of 237, 419, and 1,549 tags, respectively, with all the test tags included. Note 
that these numbers of tags are larger than the number of tags that can be actually 
evaluated. This allows us to build a unified evaluation framework that is more handy 
for cross-dataset evaluation. 


3.3. Evaluating tag assignment 

Evaluation criteria. A good method for tag assignment shall rank relevant tags before 
irrelevant tags for a given test image. Moreover, with the assigned tags, relevant im¬ 
ages shall be ranked before irrelevant images for a given test tag. We therefore use 
the image-centric Mean image Average Precision (MiAP) to measure the quality of tag 
ranking, and the tag-centric Mean Average Precision (MAP) to measure the quality 
of image ranking. Let m gt be the number of ground-truthed test tags, which is 14 for 
MIRFlickr and 81 for NUS-WIDE. The image-centric Average Precision of a given test 
image x is computed as 

m gt 

iAP(x) -=-5^2 — <5 ( X >^)> (1) 

3 = 1 3 
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where R is the number of relevant tags of the given image, r 7 is the number of relevant 
tags in the top j ranked tags, and 5(xi,tj) = 1 if tag t :i is relevant and 0 otherwise. 
MiAP is obtained by averaging iAP(x) over the test images. 

The tag-centric Average Precision of a given test tag t is computed as 

1 n 

( 2 ) 

i—1 

where R is the number of relevant images for the given tag, and r, is the number of 
relevant images in the top i ranked images. MAP is obtained by averaging AP(t ) over 
the test tags. 

The two metrics are complementary to some extent. Since MiAP is averaged over 
images, each test image contributes equally to MiAP, as opposed to MAP where each 
tag contributes equally. Consequently, MiAP is biased towards frequent tags, while 
MAP can be easily affected by the performance of rare tags, especially when m gt is 
relatively small. 

Baseline. Any method targeting at tag assignment shall be better than a random 
guess, which simply returns a random set of tags. The RandomGuess baseline is ob¬ 
tained by computing MiAP and MAP given the random prediction, which is run 100 
times with the resulting scores averaged. 

3.4. Evaluating tag refinement 

Evaluation criteria. As tag refinement is also meant for improving tag ranking and 
image ranking, it is evaluated by the same criteria, i.e., MiAP and MAP, as used for 
tag assignment. 

Baseline. A natural baseline for tag refinement is the original user tags assigned to 
an image, which we term as UserTags. 


3.5. Evaluating tag retrieval 

Evaluation criteria. To compare methods for tag retrieval, for each test tag we first 
conduct tag-based image search to retrieve images labeled with that tag, and then sort 
the images by the tag relevance scores. We use MAP to measure the quality of the en¬ 
tire image ranking. As users often look at the top ranked results and hardly go through 
the entire list, we also report Normalized Discounted Cumulative Gain (NDCG), com¬ 
monly used to evaluate the top few ranked results of an information retrieval system 
|Jarvelin and Kekalainen 2002| . Given a test tag t, its NDCG at a particular rank 
position h is defined as: 


NDCG h (t) := 


DCG h (t ) 
IDCG h (ty 


(3) 


where DCGhit) = J2i= i log^i+i) > re ^ is the graded relevance of the result at position 
i, and IDCGh is the maximum possible DCG till position h. We set h to be 20, which 
corresponds to a typical number of search results presented on the first two pages of a 
web search engine. Similar to MAP, NDCG 2 o of a specific method on a specific test set 
is averaged over the test tags of that test set. 

Baselines. When searching for relevant images for a given tag, it is natural to ask 
how much a specific method gains compared to a baseline system which simply returns 
a random subset of images labeled with that tag. Similar to the refinement baseline, 
we also denote this baseline as UserTags, as both of them purely use the original user 
tags. For each test tag, the test images labeled with this tag are sorted at random, and 
MAP and NDCG 2 o are computed accordingly. The process is executed 100 times, and 
the average score over the 100 runs is reported. 
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The number of tags per imag e is often included for image ranking in previous works 
BLiu et al. 2009 Xu et al. 2009| |. Hence, we build another baseline system, denoted as 
TagNum, which sort images in ascending order by the number o f tags per image. The 
third baseline, denoted as TagPosition, is from [Sun et al. 20111, where the relevance 
score of a tag is determined by its position in the original tag list uploaded by the user. 
More precisely, the score is computed as 1 —position(t)/l, where l is the number of tags. 


4. METHODS SELECTED FOR COMPARISON 

Despite the rich literature, most works do not provide code. An exhaustive evaluation 
covering all published methods is impractical. We have to leave out methods that do 
not show significant improvements or novelties w.r.t. the seminal papers in the field, 
and methods that are difficult to replicate with the same mathematical preciseness as 
intended by their developers. We drive our choice by the intention to cover methods 
that aim for each of the three tasks, exploiting varied modalities by distinct learning 
mechanisms. Eventually we evaluate 11 representative methods. For each method we 
analyze its scalability in terms of both computation and memory. Our analysis leaves 
out operations that are independent of specific tags and thus only need to be executed 
once in an offline manner, such as visual feature extraction, tag preprocessing, prior 
information p reco mputing, and filtering. Main properties of the methods are summa¬ 
rized in table |IV| Concerning the choices of parameters, we adopt what the original 


papers recommend. When no recommendation is given for a specific method, we try a 
range of values to our best understanding, and choose the parameters that yield the 
best overall performance. 


4.1. Methods under analysis 

1. SemanticField [Zhu et al. 20121. This method measures tag relevance in terms of 
an averaged semantic similarity between the tag and the other tags assigned to the 
image: 


1 


f 'SemField{% ; t) ■— , ^ ' silTl{t 1 ti ) 


(4) 


i =1 


where {ti,... ,tj x } is a list of l x social tags assigned to the image x, and sim(t,ti ) de¬ 
notes a semantic similarity between two tags. SemanticField explicitly assumes that 
several tags are associated to visual data and their coexistence is accounted in the 
evaluation of tag relevance. Following [ |Zhu et al. 2012) the similarity is computed by 
combining the Fli ckr context similarity and the WordNet Wu-Palmer similarity | |Wu| 

and Palmer 19941. The WordNet based similarity exploits path length in the Word- 
Net hierarchy to infer tag relatedness. We make a small revision of [Zhu et al. 20121, 
i.e. combining the two similarities by averaging instead of multiplication, because the 
former strategy produces slightly better results. SemanticField requires no training 
except for computing tag-wise similarity, which can be computed offline and is thus 
omitted. Having all tag-wise similarities in memory, applying Eq. (|4]) requires l x ta¬ 
ble lookups per tag. Hence, the computational complexity is 0(m ■ Q), and 0(rn * 2 ) for 
memory. 


2. TagRanking [Liu et al. 2009 1. The tag ranking algorithm consists of two steps. 
Given an image x and its tags, the first step produces an initial tag relevance score 
for each of the tags, obtained by (Gaussian) kernel density estimation on a set of n = 
1,000 images labeled with each tag, separately. Secondly, a random walk is performed 
on a tag graph where the edges are weighted by a tag-wise similarity. We use the 
same similarity as in SemanticField. Notice that when applied for tag retrieval, the 
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algorithm uses the rank of t instead of its score, i.e., 

fTagRanking (*T; t) Tdnkijk} -}- - , (5) 

^X 

where rank{t ) returns the rank of t produced by the tag ranking algorithm. The term 
j- is a tie-breaker when two images have the same tag rank. Hence, for a given tag t, 
TagRanking cannot distinguish relevant images from irrelevant images if t is the sole 
tag assigned to them. It explicitly exploits the coexistence of several tags per image. 
TagRanking has no learning stage. To derive tag ranks for Eq. [5] the main computation 
is the kernel density estimation on n socially-tagged examples for each tag, followed 
by an L iteration random walk on the tag graph of m nodes. All this results in a com¬ 
putation cost of 0(m ■ d ■ n + L ■ m 2 ) per test image. Because the two steps are executed 
sequentially, the corresponding memory cost is 0(max(dh, to 2 )). 

3. KNN [Makadia et al. 20101. This algorithm estimates the relevance of a given 
tag with respect to an image by first retrieving k nearest neighbors from S based on 
a visual distance d, and then counting the tag occurrence in associated tags of the 
neighborhood. In particular, KNN builds /$(x, t; 0) as: 


f knn{x , t) := k t , 


( 6 ) 


where k t is the number of images with /. in the visual neighborhood of x. The instance- 
based KNN requires no training. The main computation of / knn is to find k nearest 
neighbors from S, which has a complexity of ()(d • |<S| + k ■ log <S|) per test image, and 
a memory footprint of 0(d ■ 6)) to store all the d-dimensional feature vectors. It is 
worth noting that these complexities are drawn from a straightforward implemen¬ 
tation of /,:-nn search, and can be substantially reduced by employing more efficient 
search techniques, c.f. [|Jegou et al. 2011| . Accelerating KNN by the product quanti¬ 
zation technique I Jegou et al. 2011] imposes an extra training step, where one has 
to construct multiple vector quantizers by K-means clustering, and further use the 
quantizers to compress the original feature vector into a few codes. 

4. TagVote [Li et al. 2009b|. The TagVote algorithm estimates the relevance of a 
tag t w.r.t. an image x by counting the occurrence frequency of t in social annotations 
of the visual neighbors of x. Different from KNN, TagVote exploits the user element, 
introducing a unique-user constraint on the neighbor set to make the voting result 
more objective. Each user has at most one image in the neighbor set. Moreover, TagVote 
takes into account tag prior frequency to suppress over frequent tags. In particular, the 
TagVote algorithm builds f<s,(x,t ; 0) as 


n t 


fTagVote(%jt) .— kt ^ |(S| 


(7) 


where n t is the number of images labeled with t in S. Following [ |Li et al. 2009b[ , we 
set k to be 1,000 for both KNN and TagVote. TagVote has the same order of complexity 
as KNN. 

5. TagProp | Guillaumin et al. 2009| Verbeek et al. 2010) . TagProp employs neighbor 
voting plus distance metric learning. A probabilistic framework is proposed where the 
probability of using images in the neighborhood is defined based on rank or distance- 
based weights. TagProp builds f<s>(x,t ; 0) as: 


fTagPropi.^it') ■ ^ ^j 


( 8 ) 


where n :i is a non-negative weight indicating the importance of the j-th neighbor x,, 
and I(xj,t) returns 1 if x :j is labeled with t, and 0 otherwise. Following [Verbeek et al. 
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20101, we use k = 1,000 and the rank-based weights, which showed similar perfor¬ 
mance to the distance-based weights. Different from TagVote that uses tag prior to pe¬ 
nalize frequent tags, TagProp promotes rare tags and penalizes frequent ones by train¬ 
ing a logistic model per tag upon fragPropix, t). The use of the logistic model makes Tag- 
Prop a model-based method. In contrast to KNN and TagVote wherein visual neighbors 
are treated equally, TagProp employs distance metric learning to re-weight the neigh¬ 
bors, yielding a learning complexity of Oil ■ m ■ k ) where l is the number of gradient 
descent iterations it needs (typically less than 10). TagProp maintains 2m extra pa¬ 
rameters for the logistic models, though their storage cost is ignorable compared to 
the visual features. Therefore, running Eq. ([8]> has the same order of complexity as 
KNN and TagVote. 

6. TagCooccur (Sigurbjornsson and Van Zwol 2008). While both SemanticField and 
TagCooccur are tag-based^ the main difference lies in how they compute the contribu¬ 
tion of a specific tag to the test tag’s relevance score. Different from SemanticField 
which uses tag similarities, TagCooccur uses the test tag’s rank in the tag ranking list 
created by sorting all tags in terms of their co-occurrence frequency with the tag in 
S. In addition, TagCooccur takes into account the stability of the tag, measured by its 
frequency. The method is implemented as 

ftagcooccur{x , t) = descriptive(t ) vote(ti , t) ■ rank-promotion(ti,t) ■ stability(ti) , (9) 

i=1 

where descriptive{t) is to damp the contribution of tags with a very high-frequency, 
rank-promotion(ti,t ) measures the rank-based contribution of ti to t, stability(ti) for 
promoting tags for which the statistics are more stable, and vote(U , t) is 1 if t is among 
the top 25 ranked tags of ti, and 0 otherwise. TagCooccur has the same order of com¬ 
plexity as SemanticField. 

7. TagCooccur+ [Li et al. 2009b| . TagCooccur+ is proposed to improve TagCooccur 
by adding the visual content. This is achieved by multiplying ft ag cooccur(x,t) with a 
content-based term, i.e., 


ftagcooccur+(x,t ) — ftag 


cooccur 


{x,t) 


k c + r c {t) - 1 ’ 


( 10 ) 


where r c (t) is the rank of t when sorting the vocabulary by fragVoteix , t) in descending 
order, and k c is a positive weighting parameter, which is empirically set to 1. While 
TagCooccur+ is grounded on TagCooccur and TagVote, the complexity of the former 
is ignorable compared to the latter, so the complexity of TagCooccurs+ is the same as 
KNN. 

8. TagFeature [Chen et al. 20121. The basic idea is to enrich image features by 
adding an extra tag feature. A tag vocabulary that consists of dl most frequent tags in S 
is constructed first. Then, for each tag a two-class linear SVM classifier is trained using 
LIBLINEAR [Fan et al. 20081. The positive training set consists of p images labeled 
with the tag in S, and the same amount of negative training examples are randomly 
sampled from images not labeled with the tag. The probabilistic output of the classifier, 
obtained by the Platt’s scaling [Lin et al. 2007J, corresponds to a specific dimension in 
the tag feature. By concatenating the tag and visual features, an augmented feature of 
d + d! dimension is obtained. For a test tag t, its tag relevance function fragFeatureix , t) 
is obtained by re-training an SVM classifier using the augmented feature. The linear 
property of the classifier allows us to first sum up all the support vectors into a single 
vector and consequently to classify a test image by the inner product with this vector. 
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That is, 


fTagFeature(.%it) •— ^ 


(ID 


where x t is the weighted sum of all support vectors and b the intercept. To build mean¬ 
ingful classifiers, we use tags that have at least 100 positive examples. While d' is 
chosen to be 400 in [Chen et al. 2012], the two smaller training sets, namely TrainlOk 
and TrainlOOk, have 76 and 396 tags satisfying the above requirement. We empiri¬ 
cally set p to 500, and do random down-sampling if the amount of images for a tag 
exceeds this number. For TagFeature, learning a linear classifier for each tag from p 
positive and p negative examples requires Q((d + d')p ) in computation and ()((d + d')p) 
in memory [Fan et al. 20081. Running Eq. ( |1 1[ > for all the m tags and n images needs 
0(nm(d + d') j in computation and ()(rn(d + <7)) in memory. 

9. RelExample [Li and Snoek 2013| . Different from TagFeature [Chen et al. 2012) 
that directly learns from tagged images, RelExample exploits positive and negative 
training examples which are deemed to be more relevant with respect to the test tag 
t. In particular, relevant positive examples are selected from S by combining Seman- 
ticField and TagVote in a late fusion manner. For negative training example acquisi¬ 
tion, they leverage Negative Bootstrap [Li et al. 2013], a negative sampling algorithm 
which iteratively selects negative examples deemed most relevant for improving clas¬ 
sification. A T-iteration Negative Bootstrap will produce T meta classifiers. The corre¬ 
sponding tag relevance function is written as 


^ T ni 

f RelExample(%it) ■— TF ^ ^ (b I ^ ' (X[ j • yij * /C(x, Xl j )) , 


( 12 ) 


;=i 


i=i 


where ai j is a positive coefficient of support vector x/j, yij € {—1,1} is class label, and 
ni the number of support vectors in the /-th classifier. For the sake of efficiency, the 
kernel function K, is instantiated with the fast intersection kernel [Maji et al. 20081. 
RelExample uses the same amount of positive training examples as TagFeature. The 
number of iterations T is empirically set to 10. For the SVM classifiers used in TagFea¬ 
ture and RelExample, the Platt’s scaling [Lin et al. 2007| | is employed to convert predic¬ 
tion scores into probabilistic output. In RelExample, for each tag learning a histogram 
intersection kernel SVM has a computation cost of 0(dp 2 ) per iteration, and 0(Tdp 2 ) 
for T iter ations. By jointly using the fast intersecti on kernel with a quantization fac¬ 
tor of q [Maji et al. 20081 and model compression [Li et al. 20131, an order of 0(dq) 
is needed to keep all learned meta classifiers in memory. Since learning a new clas¬ 
sifier needs a memory of O(dp), the overall memory cost for training RelExample is 
0(dp + dq). For each tag, model compression is applied to its learned ensemble in ad¬ 
vance to running Eq. ( |12[ ). As a consequence, the compressed classifier can be cached 
in an order of O(dq) and executed in an order of O(d'). 

10. RobustPCA [Zhu et al. 20T0 1. On the base of robust principal component anal¬ 
ysis [Candes et al. 20111, RobustPCA factorizes the image-tag matrix I) by a low rank 
decomposition with error sparsity. That is, 


D = D + E, 


(13) 


where the reconstructed D has a low rank constraint based on the nuclear norm, and 
E is an error matrix with a i \ -norm sparsity constraint. Notice that the decomposition 
is not unique. So for a better solution, the decomposition process takes into account im¬ 
age affinities and tag affinities, by adding two extra penalties with respect to a Lapla- 
cian matrix Li from the image affinity graph and another Laplacian matrix L t from 
the tag affinity graph. Consequently, two hyper-parameters Ai and A 2 are introduced 
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to balance the error sparsity and the two Laplacian strengths. We follow the original 
paper and set the two parameters by performing a grid search on the very same pro¬ 
posed range. To address the tag sparseness, the authors employ a preprocessing step 
to refine D by a weighted KNN propagation based on the visual similarity. RobustPCA 
requires an iterative procedure based on the accelerated proximal gradient method 
with a quadratic convergence rate [Zhu et al. 2010]. Each iteration spends the major¬ 
ity of the required time pe rforming Singular Value Decomposition that, according to 
[Golub and Van Loan 2012), has a well known complexity of 0(cm 2 n + c'n 3 ) where c, c' 
are constantsTllegarding memory, it has a requirement of 0(cn -m + d • ( n 2 + m 2 )) as 
it needs to maintain a full copy of D and Laplacians of images and labels. 

11. TensorAnalysis [Sang et al. 2012a|. This method considers ternary relation¬ 
ships between images, tags and user, by extending the image-tag association matrix to 
a binary user-image-tag tensor F G {0,1}I ,V I X I V I X I W I. The tensor is factorized by Tucker 
decomposition into a dense core C and three low rank matrices U, I, T, corresponding 
to the user, image, and tag modalities, respectively: 


F = C x u U Xil XtT, (14) 

Here Xj is the tensor product between a tensor and a matrix along dimension j e 
{u,i,t}. The idea is that C contains the interactions between modalities, while each 
low-rank matrix represents the main components of each modality. Every modality 
has to be sized manually or by energy retention, adding three needed parameters R = 
(rj, r T , ru )• The tag relevance scores are obtained by computing D = C x^ x t T x u l Tu . 
Similar to RobustPCA, the decomposition in Eq. < |14[ ) is not unique and a better solution 
may be found by regularizing the optimization process with a Laplacian built on a 
similarity graph for each modality, i.e., L,, L t , and L u , and a £ 2 regularizer on each 
factor i.e. C, U, I and T. For TensorAnalysis, the complexity is 0(|Pi| • (tt ■ m 2 + ru • 
ry ■ 7't)), proportional to the number of tags l\ asserted in I) and the dimension of low 
rank ru,rj,r T factors. The memory required is 0(n 2 + m 2 + u 2 ) for the Laplacians of 
images, tags and users. 

4.2. Considerations 

An overview of the methods analyzed is given Table |IV| Among them, SemanticField, 
counting solely on the tag modality, has the best scalaDinty with respect to both compu¬ 
tation and memory. Among the instance-based methods, TagRanking, which works on 
selected subsets of S rather than the entire collection, has the lowest memory request. 
When the number of tags to be modeled is substantially smaller than the size of S, 
the model-based methods require less memory and run faster in the test stage, but at 
the expense of SVM model learning in the training stage. The two transduction-based 
methods have limited scalability, and can operate only on small sized S. 

5. EVALUATION 

This section presents our evaluation of the 11 methods according to their applicability 
to the three tasks using the proposed experimental protocol, that i s, K NN, TagVote, 
TagProp, TagFeature and RelExample for tag assignment (Section 1 5.1 [ i , Ta gCo occur, 
TagCooccur+, RobustPCA, a nd T ensorAnalysis for tag refinement (Section |5.2| >, and 
all for tag retrieval (Section |5.3[ >. For TensorAnalysis we were able to evaluate only 
tag refinement with BovW features on MIRFlickr with TrainlOk and TrainlOOk. The 
reason for this exception is that our implementation of TensorAnalysis performs worse 
than the baseline. Consequently, the results of TensorAnalysis were kindly provided by 
the authors in the form of tag ranks. Since the provided tag ranks cannot be converted 
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Table IV. Main properties of the eleven methods evaluated in this survey following the dimensions of Fig. [T| The computational 
and memory complexity of each method is based on processing n test images and m test tags by exploiting the training set S. 





Auxiliary Component 


Learning 



Methods 

Test Media 

Task 

Filter 

Precompute 

Train Computation 

Test Computation 

Train Memory 

Test Memory 

Instance-based: 









SemanticField 

tag 

Retrieval 

WordNet 

sim(t, t') 

~ 

0(nml x ) 

- 

0(m 2 ) 

TagCooccur 

tag 

Refinement 

Retrieval 

- 

Tag prior 
Co-occurrence 

- 

0(nml x ) 

- 

0(m 2 ) 

TagRanking 

tag + image 

Retrieval 

- 

sim(t, t') 

- 

0(n(mdn + Lm 2 )) 

- 

0(max(dn, m 2 )) 

KNN 

tag + image 

Assignment 

Retrieval 

- 

- 

- 

0(n(d|S| + fclog |<S|)) 

- 

o(4S|) 

TagVote 

tag + image 

Assignment 

Retrieval 

- 

Tag prior 

- 

C>(n(d|<S| + fclog |S|)) 

- 

o(4S[) 

TagCooccur+ 

tag + image 

Refinement 

Retrieval 

- 

Tag prior 
Co-occurrence 

- 

0(n(d|S| + fclog |5|)) 

- 

o(4S|) 

Model-based: 









TagProp 

tag + image 

Assignment 

Retrieval 

- 

- 

0(1 ■ m ■ k) 

C>(n(d|S| + fclog |<S|)) 

0(d|5| + 2m) 

0(d|<S| + 2m) 

TagFeature 

tag + image 

Assignment 

Retrieval 

- 

Tag classifiers 

0(m(d + d!)p) 

0(nm(d + d')) 

0((d + d')p) 

0(m(d + d')) 

RelExample 

tag + image 

Assignment 

Retrieval 

SemField 
+ TagVote 

sim(t, t') 

0(mTdp 2 ) 

0(dp + dq) 

0(nmd) 

0(mdq) 

Transduction-based: 









RobustPCA 

tag + image 

Refinement 

Retrieval 

WordNet 

+ KNN 

Li, L t 

0(cnv 

2 n + c'n 3 ) 

0(cnm + d ■ 

(n 2 + m 2 )) 

TensorAnalysis 

tag + image + user 

Refinement 

Postag sets 

Li,L t ,L u 

0(|Pi| • ( rr ■ m 2 + ru ■ ri ■ rr)) 

0(n 2 + m 2 + u 2 ) 


to image ranks, we could not compute MAP scores. A com parison between our Flickr 
based training data and ImageNet is given in Section |5.4[ 


5.1. Tag assignment 

Table [V] shows the tag assignment performance of KNN, TagVote, TagProp, TagFea- 
ture and RelExample. Their superior performance against the RandomGuess baseline 
shows that learning purely from social media is meaningful. TagVote and TagProp 
are the two best performing methods on both test sets. Substituting CNN for BovW 
consistently brings improvements for all methods. 

In more detail, the following considerations hold. TagProp has higher MAP perfor¬ 
mance than KNN and TagVote in almost all the cases under analysis. As discussed in 
Section® TagProp is built upon KNN, but it weights the neighbor images by rank and 
applies alogistic model per tag. Since the logistic model does not affect the image rank¬ 
ing, the superior performance of TagProp should be ascribed to rank-based neighbor 
weighting. A per-tag comparison on MIRFlickr is given in Fig. [2| TagProp is almost al¬ 
ways ahead of KNN and TagVote. Concerning TagVote and KNN, recall that their main 
difference is that TagVote applies the unique-user constraint on the neighborhood and 
it employs tag prior as a penalty term. The fact that the training data contains no 
batch-tagged images minimizes the influence of the unique-user constraint. While the 
penalty term does not affect image ranking for a given tag, it affects tag ranking for a 
given image. This explains why KNN and TagVote have mostly the same MAP. Also, 
the result suggests that the tag prior based penalty is helpful for doing tag assignment 
by neighbor voting. 

We observe that RelExample has a better MAP than TagFeature in every case. The 
absence of a filtering component makes TagFeature more likely to overfit to train¬ 
ing examples irrelevant to the test tags. For the other two model-based methods, the 
overfit issue is alleviated by different strategies: RelExample employs a filtering com- 
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Table V. Evaluating methods for tag assignment. Given the same feature, bold values indicate top performers 
on individual test sets. 


Method 


MIRFlickr 



NUS-WIDE 


Train 10k 

TrainlOOk 

Trainlm 

TrainlOk 

TrainlOOk 

Trainlm 

MiAP scores: 







RandomGuess 

0.147 

0.147 

0.147 

0.061 

0.061 

0.061 

BovW + KNN 

0.232 

0.286 

0.312 

0.171 

0.217 

0.248 

BovW + TagVote 

0.276 

0.310 

0.328 

0.183 

0.231 

0.259 

BovW + TagProp 

0.276 

0.299 

0.314 

0.230 

0.249 

0.268 

BovW + TagFeature 

0.278 

0.294 

0.298 

0.244 

0.221 

0.214 

BovW + RelExample 

0.284 

0.309 

0.303 

0.257 

0.233 

0.245 

CNN + KNN 

0.326 

0.366 

0.379 

0.315 

0.343 

0.376 

CNN + TagVote 

0.355 

0.378 

0.389 

0.340 

0.370 

0.396 

CNN + TagProp 

0.373 

0.384 

0.392 

0.366 

0.376 

0.380 

CNN + TagFeature 

0.359 

0.378 

0.383 

0.367 

0.338 

0.373 

CNN + RelExample 

0.309 

0.385 

0.373 

0.365 

0.354 

0.388 

MAP scores: 







RandomGuess 

0.072 

0.072 

0.072 

0.023 

0.023 

0.023 

BovW + KNN 

0.231 

0.282 

0.336 

0.094 

0.139 

0.185 

BovW + TagVote 

0.228 

0.280 

0.334 

0.093 

0.137 

0.184 

BovW + TagProp 

0.245 

0.293 

0.342 

0.102 

0.149 

0.193 

BovW + TagFeature 

0.200 

0.199 

0.201 

0.090 

0.096 

0.098 

BovW + RelExample 

0.284 

0.303 

0.310 

0.119 

0.155 

0.172 

CNN + KNN 

0.564 

0.613 

0.639 

0.271 

0.356 

0.400 

CNN + TagVote 

0.561 

0.613 

0.638 

0.257 

0.358 

0.402 

CNN + TagProp 

0.586 

0.619 

0.641 

0.305 

0.376 

0.397 

CNN + TagFeature 

0.444 

0.554 

0.563 

0.262 

0.310 

0.326 

CNN + RelExample 

0.538 

0.603 

0.584 

0.300 

0.346 

0.373 


C/5 

05 

cs 



0.1 0.2 0.3 0.4 0.5 0.6 

Average Precision 


0.7 


0.8 


0.9 


Fig. 2. Per-tag comparison of methods for tag assignment on MIRFlickr, trained on Trainlm. The 
colors identify the features used: blue for BovW, red for CNN. The test tags have been sorted in descending 
order by the performance of CNN + TagProp. 


ponent to select more relevant training examples, while TagProp has less parameters 
to tune. 

A per-image comparison on NUS-WIDE is given in Fig. [3] The test images are put 
into disjoint groups so that images within the same group have the same number of 
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ground truth tags. For each group, the area of the colored bars is proportional to the 
number of images on which the corresponding methods score best. The first group, i.e., 
images containing only one ground-truth tag, has the most noticeable change as the 
training set grows. There are 75,378 images in this group, and for 39% of the images, 
their single label is ‘person’. When Train lm is used, RelExample beats KNN, TagVote, 
and TagProp for this frequent label. This explains the leading position of Rel Exam ple 
in the first group. The result also confirms our earlier discussion in Section 3.3 that 
MiAP is likely to be biased by frequent tags. 




Trainlm - NUS-WIDE 


Ol 7 
< 


E 3 


| CNN + KNN 
] CNN + TagVote 
] CNN + TagProp 
| CNN + TagFeature 
| CNN + RelExample 


I 


Number of ground truth tags 


12345678910111213 
Number of ground truth tags 


Fig. 3. Per-image comparison of methods for tag assignment on NUS-WIDE. Test images are 
grouped in terms of their number of ground truth tags. The area of a colored bar is proportional to the 
number of images that the corresponding method scores best. 


In summary, as long as enough training examples are provided, instance-based 
methods are on par with model-based methods for tag assignment. Model-based meth¬ 
ods are more suited when the training data is of limited availability. However, they 
are less resilient to noise, and consequently a proper filtering strategy for refining the 
training data becomes essential. 

5.2. Tag refinement 

Table [VT] shows the performance of different methods for tag refinement. We were un¬ 
able to complete the table. In particular, RobustPCA could not go over 350k images 
due to its high demand in both CPU time and memory (see Table |IVj >, while Tensor- 
Analysis was provided by the authors only on MIRFlickr with TramlOk, TrainlOOk, 
and the BovW feature. 

RobustPCA outperforms the competitors on both test sets, when provided with the 
CNN feature. Fig. [4] presents a per-tag comparison on MIRFlickr. RobustPCA has the 
best scores for 9 out of the 14 tags with BovW, and wins all the tags when CNN is used. 

Concerning the influence of the media dimension, the tag + image based methods 
(RobustPCA and TagCooccur+) are in general better than the tag based method (Tag- 
Cooccur). As shown in Fig. [4j except for 3 out of 14 MIRFlickr test tags with BovW, 
using the image media is beneficial. As in the tag assignment task, the use of the CNN 
feature strongly improves the performance. 

Concerning the learning methods, TensorAnalysis has the potential to leverage tag, 
image, and user simultaneously. However, due to its relatively poor scalability, we 
were able to run this method only with TrainlOk and TrainlOOk on MIRFlickr. For 
TrainlOk, TensorAnalysis yielded higher MiAP than RobustPCA, probably thanks to 
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Table VI. Evaluating methods for tag refinement. The asterisk (*) indicates results provided by the authors of the 
corresponding methods, while the dash (-) means we were unable to produce results. Given the same feature, 
bold values indicate top performers on individual test sets per performance metric. 


MIRFlickr NUS-WIDE 


Method TrainlOk TrainlOOk Trainlm TrainlOk TrainlOOk Trainlm 


MiAP scores: 

UserTags 

0.204 

0.204 

TagCooccur 

0.213 

0.242 

BovW + TagCooccur+ 

0.217 

0.262 

BovW + RobustPCA 

0.271 

0.310 

BovW + TensorAnalysis 

*0.298 

*0.297 

CNN + TagCooccur+ 

0.234 

0.277 

CNN + RobustPCA 

0.368 

0.376 

CNN + TensorAnalysis 

- 

- 

MAP scores: 

UserTags 

0.263 

0.263 

TagCooccur 

0.266 

0.298 

BovW + TagCooccur+ 

0.294 

0.343 

BovW + RobustPCA 

0.225 

0.337 

BovW + TensorAnalysis 

- 

- 

CNN + TagCooccur+ 

0.330 

0.381 

CNN + RobustPCA 

0.566 

0.627 

CNN + TensorAnalysis 

- 

- 


0.204 

0.255 

0.255 

0.255 

0.253 

0.269 

0.305 

0.317 

0.286 

0.245 

0.332 

0.297 

0.323 

0.324 

0.310 

0.305 

0.424 

0.359 

0.419 

0.387 

0.263 

0.338 

0.338 

0.338 

0.313 

0.223 

0.321 

0.308 

0.377 

0.231 

0.229 

0.345 

0.234 

0.353 

0.420 

0.264 

0.439 

0.391 

0.440 

0.406 
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Fig. 4. Per-tag comparison of methods for tag refinement on MIRFlickr, trained on TrainlOOk. The 
colors identify the features used: blue for BovW, red for CNN. The test tags have been sorted in descending 
order by the performance of CNN + RobustPCA. 


its capability of modeling user correlations. It is outperformed by RobustPCA when 
more training data is used. 

As more training data is used, the performance of TagCooccur, TagCooccur+, and 
RobustPCA on MIRFlickr consistently improves. Since these three methods rely on 
data-driven tag affinity, image affinity, or tag and image affinity, a small set of 10k 
images is generally inadequate to compute these affinities. The effect of increasing 
the training set size is clearly visible if we compare scores corresponding to TrainlOk 


ACM Computing Surveys, Vol. X, No. X, Article X, Publication date: March 2016. 



















































































X:26 


X. Li et al. 


a. 7 
< 


5 

cn 4 


E 3 


x10 4 TrainlOk- NUS-WIDE 



123456789 1011 1213 
Number of ground truth tags 



Number of ground truth tags 



Fig. 5. Per-image comparison of methods for tag refinement on NUS-WIDE. Test images are 
grouped in terms of their number of ground truth tags. The area of a colored bar is proportional to the 
number of images that the corresponding method scores best. 


and TrainlOOk. The results on NUS-WIDE show some inconsistency. For TagCooccur, 
MiAP improves from TrainlOOk to Trainlm, while MAP drops. This is presumably 
due to the fact that in the experiments we used the parameters recommended in the 
original paper, appropriately selected to optimize tag ranking. Hence, they might be 
suboptimal for image ranking. BovW + RobustPCA scores a lower MAP than BovW 
+ TagCooccur+. This is probably due to the fact that the low-rank matrix factoriza¬ 
tion technique, while being able to jointly exploit tag and image information, is more 
sensitive to the content-based representation. 

A per-image comparison is given in Fig. [5] As for tag assignment, the test images 
have been grouped according to the number of ground truth tags associated. The size 
of the colored areas is proportional to the number of images where the corresponding 
method scores best. For the majority of test image, the three tag refinement meth¬ 
ods have higher average precision than UserTags. This means more relevant tags are 
added, so the tags are refined. It should be noted that the success of tag refinement 
depends much on the quali ty of the original tags assigned to the test images. Exam¬ 
ples are shown in Table VII | in row 6, although the tag ‘earthquake’ is irrelevant to the 
image content, it is ranked at the top by RobustPCA. To what extent a tag refinement 
method shall count on the existing tags is tricky. 

To summarize, the tag + image based methods outperform the tag based method 
for tag refinement. RobustPCA is the best, and improves as more training data is 
employed. Nonetheless, implementing RobustPCA is challenging for both computation 
and memory footprint. In contrast, TagCooccur+ is more scalable and it can learn from 
large-scale data. 


5.3. Tag retrieval 

Table VIII shows the performance of different methods for tag retrieval. Recall that 
when retrieving images for a specific test tag, we consider only images that are labeled 


with this tag. Hence, MAP scores here are higher than their counterpart in Table VI 


We start our analysis by comparing the three baselines, namely UserTags, TagNum, 
and TagPosition, which retrieve images simply by the original tags. As it can be no¬ 
ticed, TagNum and TagPosition are more effective than UserTags, TagNum outper¬ 
forms TagPosition on Flickr51, and the latter has better scores on NUS-WIDE. The 
effectiveness of such metadata based features depend much on datasets, and are un¬ 
reliable for tag retrieval. 
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Table VII. Selected tag assignment and refinement results on NUS-WIDE. Visual feature: BovW. The top five ranked tags are 
shown, with correct prediction marked by the bold italic font. 
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All the methods considered have higher MAP than the three baselines. All the meth¬ 
ods have better performance than the baselines on Flickr51 and performance increases 
with the size of the training set. On NUS-WIDE, SemanticField, TagCooccur, and 
TagRanking, are less effective than TagPosition. We attribute this result to the fact 
that, for these methods, the tag relevance functions favor images with fewer tags. So 
they closely follow similar performance and dataset dependency. 

Concerning the influence of the media dimension, the tag + image based methods 
(KNN, TagVote, TagProp, TagCooccur+, TagFeature, RobustPCA, RelExample) are in 
general better than the tag based method (SemanticField and TagCooccur). Fig. [6] 
shows the per-tag retrieval performance on Flickr51. For 33 out of the 51 test tags, 
RelExample exhibits average precision higher than 0.9. By examining the top retrieved 
images, we observe that the results produced by tag + image based methods and tag 
based methods are complementary to some extent. For example, consider ‘military’, 
one of the test tags of NUS-WIDE. RelExample retrieves images with strong visual 
patterns such as military vehicles, while SemanticField returns images of military 
personnel. Since the visual content is ignored, the results of SemanticField tend to be 
visually different, so making it possible to handle tags with visual ambiguity. This fact 
can be observed in Fig. [7J which shows the top 10 ranked images of‘jaguar’ by TagPo¬ 
sition, SemanticField, BovW + RelExample, and CNN + RelExample. Although their 
results are all correct, RelExample finds jaguar-brand cars only, while SemanticField 
covers both cars and animals. However, for a complete evaluation of the capability of 
managing ambiguous tags, fine-grained ground truth beyond what we currently have 
is required. 

Concerning the learning methods, TagVote consistently performs well as in the tag 
assignment expe rime nt. KNN is comparable to TagVote, due to the reason we have dis¬ 
cussed in Section [57X1 G iven the CNN feature, the two methods even outperform their 
model-based variant TagProp. Similar to the tag refinement experiment, the effective¬ 
ness of RobustPCA for tag retrieval is sensitive to the choice of visual features. While 
BovW + RobustPCA is worse than the majority on Flickrt51, the performance of CNN 
+ RobustPCA is more stable, and performs well. For TagFeature, its gain from using 
larger training data is relatively limited due to the absence of denoising. In contrast, 
RelExample, by jointly using SemanticField and TagVote in its denoising component, 
is consistently better than TagFeature. 

The performance of individual methods consistently improves as more training data 
is used. As the size of the training set increases, the performance gap between the best 
model-based method (RelExample) and the best instance-based method (TagVote) re¬ 
duces. This suggests that large-scale training data diminishes the advantage of model- 
based methods against the relatively simple instance-based methods. 

In summary, even though the performance of the methods evaluated varies over 
datasets, common patterns have been observed. First, the more social data for train¬ 
ing are used the better performance is obtained. Since the tag relevance functions are 
learned purely from social data without any extra manual labeling, and social data are 
increasingly growing, this result promises that better tag relevance functions can be 
learned. Second, given small-scale training data, tag + image based methods that con¬ 
ducts model-based learning with denoised training examples turn out to be the most 
effective solution, This however comes with a price of reducing the visual diversity 
in the retrieval results. Moreover, the advantage of model-based learning vanishes as 
more training data and the CNN feature are used, and TagVote performs the best. 

5.4. Flickr versus ImageNet 

To address the question of whether one shall resort to an existing resource such as 
ImageNet for tag relevance learning, this section presents an empirical comparison 
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Table VIII. Evaluating methods for tag retrieval. Given the same feature, bold values indicate top per¬ 
formers on individual test sets per performance metric. 


Method 

Flickr51 


NUS-WIDE 


TrainlOk TrainlOOk 

Train lm 

TrainlOk TrainlOOk 

Trainlm 

MAP scores: 







UserTags 

0.595 

0.595 

0.595 

0.489 

0.489 

0.489 

TagNum 

0.664 

0.664 

0.664 

0.520 

0.520 

0.520 

TagPosition 

0.640 

0.640 

0.640 

0.557 

0.557 

0.557 

SemanticField 

0.687 

0.707 

0.713 

0.565 

0.584 

0.584 

TagCooccur 

0.625 

0.679 

0.704 

0.534 

0.576 

0.588 

BovW + TagCooccur+ 

0.640 

0.732 

0.764 

0.560 

0.622 

0.643 

BovW + TagRanking 

0.685 

0.686 

0.708 

0.557 

0.574 

0.578 

BovW + KNN 

0.678 

0.742 

0.770 

0.587 

0.632 

0.658 

BovW + TagVote 

0.678 

0.741 

0.769 

0.587 

0.632 

0.659 

BovW + TagProp 

0.671 

0.748 

0.772 

0.585 

0.636 

0.657 

BovW + TagFeature 

0.689 

0.726 

0.737 

0.589 

0.602 

0.606 

BovW + RelExample 

0.706 

0.756 

0.783 

0.609 

0.645 

0.663 

BovW + RobustPCA 

0.697 

0.701 

- 

0.650 

0.650 

- 

BovW + TensorAnalysis 

- 

- 

- 

- 

- 

- 

CNN + TagCooccur+ 

0.654 

0.781 

0.821 

0.572 

0.653 

0.674 

CNN + TagRanking 

0.744 

0.735 

0.747 

0.589 

0.590 

0.590 

CNN + KNN 

0.811 

0.859 

0.880 

0.683 

0.722 

0.734 

CNN + TagVote 

0.808 

0.859 

0.881 

0.675 

0.724 

0.738 

CNN + TagProp 

0.824 

0.867 

0.879 

0.689 

0.727 

0.731 

CNN + TagFeature 

0.827 

0.853 

0.859 

0.675 

0.700 

0.703 

CNN + RelExample 

0.838 

0.863 

0.878 

0.689 

0.717 

0.734 

CNN + RobustPCA 

0.811 

0.839 

- 

0.725 

0.726 

- 

CNN + TensorAnalysis 

- 

- 

- 

- 

- 

- 

NDCG20 scores: 







UserTags 

0.432 

0.432 

0.432 

0.487 

0.487 

0.487 

TagNum 

0.522 

0.522 

0.522 

0.541 

0.541 

0.541 

TagPosition 

0.511 

0.511 

0.511 

0.623 

0.623 

0.623 

SemanticField 

0.591 

0.623 

0.645 

0.596 

0.622 

0.624 

TagCooccur 

0.482 

0.527 

0.631 

0.529 

0.602 

0.614 

BovW + TagCooccur+ 

0.503 

0.625 

0.686 

0.590 

0.681 

0.734 

BovW + TagRanking 

0.530 

0.568 

0.571 

0.557 

0.572 

0.572 

BovW + KNN 

0.577 

0.699 

0.756 

0.638 

0.734 

0.799 

BovW + TagVote 

0.573 

0.701 

0.754 

0.629 

0.734 

0.804 

BovW + TagProp 

0.570 

0.715 

0.759 

0.666 

0.750 

0.809 

BovW + TagFeature 

0.547 

0.626 

0.646 

0.622 

0.615 

0.618 

BovW + RelExample 

0.614 

0.722 

0.748 

0.692 

0.736 

0.776 

BovW + RobustPCA 

0.549 

0.548 

- 

0.768 

0.781 

- 

BovW + TensorAnalysis 

- 

- 

- 

- 

- 

- 

CNN + TagCooccur+ 

0.504 

0.615 

0.724 

0.571 

0.705 

0.738 

CNN + TagRanking 

0.577 

0.607 

0.597 

0.578 

0.594 

0.583 

CNN + KNN 

0.709 

0.830 

0.897 

0.773 

0.832 

0.863 

CNN + TagVote 

0.722 

0.826 

0.899 

0.740 

0.837 

0.879 

CNN + TagProp 

0.768 

0.857 

0.865 

0.764 

0.839 

0.845 

CNN + TagFeature 

0.755 

0.813 

0.818 

0.704 

0.807 

0.787 

CNN + RelExample 

0.764 

0.843 

0.879 

0.773 

0.814 

0.866 

CNN + RobustPCA 

0.733 

0.821 

- 

0.865 

0.862 

- 

CNN + TensorAnalysis 

- 

- 

- 

- 

- 

- 
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Fig. 6. Per-tag comparison between TagPosition, SemanticField, TagVote, TagProp, and RelEx- 
ample on Flickr51, with Trainlm as the training set. The 51 test tags have been sorted in descending 
order by the performance of RelExample. 


between our Flickr based training data and ImageNet. A number of methods do not 
work with ImageNet or require modifications. For instance, tag + image + user infor¬ 
mation based methods must be able to remove their dependency on user information, 
as such information is unavailable in ImageNet. Tag co-occurrence statistics is also 
strongly limited, because an ImageNet example is annotated with a single label. Be¬ 
cause of these limitations, we evaluate only the two best performing methods, TagVote 
and TagProp. TagProp can be directly used since it comes from classic image annota¬ 
tion, while TagVote is slightly modified by removing the unique user constraint. The 
CNN feature is used for its superior performance against the BovW feature. 

To construct a customized subset of ImageNet that fits the three test sets, we take 
ImageNet examples whose labels precisely match with the test tags. Notice that some 
test tags, e.g., ‘portrait’ and ‘night’, have no match, while some other tags, e.g, ‘car’ and 
‘dog’, have more than one matches. In particular, MIRFlickr has 2 missing tags, while 
the number of missing tags on Flickr51 and NUS-WIDE is 9 and 15. For a fair compar¬ 
ison these missing tags are excluded from the evaluation. Putting the remaining test 
tags together, we obtain a subset of ImageNet, containing 166 labels and over 200k 
images, termed ImageNet200k. 
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(a) TagPosition (b) SemanticField (c) BovW + RelExample (d) CNN + RelExample 


Fig. 7. Top 10 ranked images of‘jaguar’, by (a) TagPosition, (b) SemanticField, (c) BovW + RelEx¬ 
ample, and (d) CNN + RelExample. Checkmarks (/) indicate relevant results. While both RelExample 
and SemanticField outperform the TagPosition baseline, the results of SemanticField show more diversity 
for this ambiguous tag. The difference between (c) and (d) suggests that the results of RelExample can be 
diversified by varying the visual feature in use. 

The left half of Table [IXlshows the performance of tag assignment. TagVote/TagProp 
trained on the ImageNet data are less effective than their counterparts trained on the 
Flickr data. For a better und ersta nding of the result, we employ the same visualization 
technique as used in Section [5TT| i.e., grouping the test images in terms of the number 
of their ground truth tags, and subsequently checking the performance per group. As 
shown in Fig. [8 while ImageNet200k performs better on the first group, i.e., images 
with a single relevant tag, it is outperformed by TrainlOOk and TrainlM on the other 
groups. For its single-label nature, ImageNet is less effective for assigning multiple 
labels to an image. 

For tag retrieval, as shown in the right half of Table |IX[ TagVote/TagProp learned 
from ImageNet200k in general have higher MAP and NDCG scores than their coun¬ 
terparts learned from the Flickr data. By comparing the performance difference per 
concept, we find that the gain is largely contributed by a relatively small amount of 
concepts. Consider for instance TagVote + ImageNet200k and TagVote + TrainlM on 
NUS-WIDE. The former outperforms the latter for 25 out of the 66 tested concepts. By 
sorting the concepts according to their absolute performance gain, the top three win¬ 
ning concepts of TagVote + ImageNet200k are ‘sand’, ‘garden’, and ‘rainbow’, with AP 
gain of 0.391, 0.284, and 0.176, respectively. Here, the lower performance of TagVote 
+ TrainlM is largely due to the subjectiveness of social tagging. For instance, Flickr 
images labeled with ‘sand’ tend be much more diverse, showing a wide range of things 
visually irrelevant to sand. Interestingly, the top three losing concepts of TagVote + 
ImageNet200k are ‘running’, ‘valley’, and ‘building’, with AP loss of 0.150, 0.107, and 
0.090, respectively. For these concepts, we observe that their ImageNet examples lack 
diversity. E.g., ‘running’ in ImageNet200k mostly shows a person running on a track. 
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Table IX. Flickr versus ImageNet. Notice that the numbers on Trainl 00k and Trainl M are different from Tables [V] and | VI1 1| 
due to the use of a reduced set of test tags. Bold values indicate top performers on a specific test set per performance 
metric. 
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Fig. 8 . Per-image comparison of TagVote/TagProp learned from different training datasets, 

tested on NUS-WIDE. Test images are grouped in terms of the number of ground truth tags. Within each 
group, the area of a colored bar is proportional to the number of images that (the method derived from) the 
corresponding training dataset scores the best. ImageNet200k is less effective for assigning multiple labels 
to an image. 


In contrast, the subjectiveness of social tagging now has a positive effect on generating 
diverse training examples. 

In summary, for tag assignment social media examples are a preferred resource of 
training data. For tag retrieval ImageNet yields better performance, yet the perfor¬ 
mance gain is largely due to a few tags where social tagging is very noisy. In such a 
case, controlled manual labeling seems indispensable. In contrast, with clever tag rele¬ 
vance learning algorithms, social training data demonstrate competitive or even better 
performance for many of the tested tags. Nevertheless, where the boundary between 
the two cases is precisely located remains unexplored. 

6. CONCLUSIONS AND PERSPECTIVES 
6.1. Concluding remarks 

This paper presents a survey on image tag assignment, refinement and retrieval, with 
the hope of illustrating connections and difference between the many methods and 
their applicabilities, and consequently helping the interested audience to either pick 
up an existing method or devise a method of their own given the data at hand. As 
the topics are being actively studied, inevitably this survey will miss some papers. 
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Nevertheless, it provides a unified view of many existing works, and consequently 
eases the effort of placing future works in a proper context, both theoretically and 
experimentally. 

Based on the key observation that all works rely on tag relevance learning as the 
common ingredient, existing works, which vary in terms of their methodologies and 
target tasks, have been interpreted in a unified framework. Consequently, a two- 
dimensional taxonomy has been developed, allowing us to structure the growing litera¬ 
ture in light of what information a specific method exploits and how the information is 
leveraged in order to produce their tag relevance scores. Having established the com¬ 
mon ground between methods, a new experimental protocol has been introduced for 
a head-to-head comparison between the state-of-the-art. A selected set of eleven rep¬ 
resentative works were implemented and evaluated for tag assignment, refinement, 
and/or retrieval. The evaluation justifies the state-of-the-art on the three tasks. 

Concerning what media is essential for tag relevance learning, tag + image is con¬ 
sistently found to be better than tag alone. While the joint use of tag, image, and user 
information (via TensorAnalysis) demonstrates its potential on small-scale datasets, it 
becomes computationally prohibitive as the dataset size increases to 100k and beyond. 
Comparing the three learning strategies, instance-based and model-based methods 
are found to be more reliable and scalable than their transduction-based counterparts. 
As model-based methods are more sensitive to the quality of social image tagging, a 
proper filtering strategy for refining the training media is crucial for their success. 
Despite their leading performance on the small training dataset, we find that the per¬ 
formance gain over the instance-based alternatives diminishes as more training data 
is used. Finally, the CNN feature used as a substitute for the BovW feature brings 
considerable improvements for all the tasks. 

Much progress has been made. Given the current test tag set, the best-performing 
methods already outperform user-provided tags for tag assignment (MiAP of 0.392 ver¬ 
sus 0.204 on MIRFlickr and 0.396 versus 0.255 on NUS-WIDE). Image retrieval using 
learned tag relevance also yields more accurate results compared to image retrieval 
using original tags (MAP of 0.881 versus 0.595 on Flickr55 and 0.738 versus 0.489 
on NUS-WIDE). For tag assignment and tag retrieval, methods that exploit tag + im¬ 
age media by instance-based learning take the leading position. In particular, for tag 
assignment, TagProp and TagVote perform best. For tag retrieval, TagVote achieves 
the best overall performance. Methods that exploit tag + image by transduction-based 
learning are more suited for tag refinement. RobustPCA is the choice for this task. 
These baselines need to be compared against when one advocates a new method. 

6.2. Reflections on future work 

Much remains to be done. Several exciting recent developments open up new oppor¬ 
tunities for the future. First, employing novel deep learning based visual features is 
likely to boost the performance of the tag + image based methods. What is scientifi¬ 
cally more interesting is to devise a learning strategy that is capable of jointly exploit¬ 
ing tag, image, and user information in a much more scalable manner than currently 
feasible. The importance of the filter component, which refines socially tagged training 
examples in advance to learning, is underestimated. As denoising often comes with the 
price of reducing visual diversity, more research attention is required to understand 
what an acceptable level of noise shall be for learning tag relevance. Having a number 
of collaboratively labeled resources publicly available, research on joint exploration of 
social data and these resources is important. This connects to the most fundamental 
aspect of content-based image retrieval in the context of sharing and tagging within 
social media platforms: to what extent a social tag can be trusted remains open. Image 
retrieval by multi-tag query is another important yet largely unexplored problem. For 
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a query of two tags, it is suggested to vi ew the two tags as a single bi-gram tag | Li et al. 


2012; Nie et al. 2012; Borth et al. 2013.1, which is found to be superior to late fusion of 


individual tag scores. Nonetheless, due to the increasing sparseness of n-grams, how 
to effectively answer generic queries of more than two tag is challenging. Test tags 
in the current benchmark sets were picked based on availability. It would be relevant 
to study what motivates people to search images on social media platforms and how 
the search is conducted. We have not seen any quantitative study in this direction. 
Last but not least, fine-grained ground truth that enables us to evaluate various tag 
relevance learning methods for answering ambiguous tags is currently missing. 


“One way to resolve the semantic gap comes from sources outside the image ...”, 
Smeulders et al. wrote at the end of their seminal paper |Smeulders et al. 2000) . While 
what such sources would be was mostly unknown by that time, it is now becoming evi¬ 
dent that the many images shared and tagged in social media platforms are promising 
to resolve the semantic gap. By adding new relevant tags, refining the existing ones 
or directly addressing retrieval, the access to the semantics of the visual content has 
been much improved. This is achieved only when appropriate care is taken to attack 
the unreliability of social tagging. 
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